Text Data Augmentation for the Korean Language

نویسندگان

چکیده

Data augmentation (DA) is a universal technique to reduce overfitting and improve the robustness of machine learning models by increasing quantity variety training dataset. Although data essential in vision tasks, it rarely applied text datasets since less straightforward. Some studies have concerned augmentation, but most them are for majority languages, such as English or French. There been only few on minority e.g., Korean. This study fills gap demonstrating several common methods Korean corpora with pre-trained language models. In short, we evaluate performance two approaches, known transformation back translation. We compare these augmentations among four downstream tasks: semantic textual similarity (STS), natural inference (NLI), question duplication verification (QDV), sentiment classification (STC). Compared cases without gains when applying 2.24%, 2.19%, 0.66%, 0.08% STS, NLI, QDV, STC respectively.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Data augmentation and language model adaptation

A method is presented for augmenting word n-gram counts in a matrix which represents a 2-gram Language Model (LM). This method is based on numerical distances in a reduced space obtained by Singular Value Decomposition (SVD). Rescoring word lattices in a spoken dialogue application using an LM containing augmented counts has lead to a Word Error Rate (WER) reduction of 6.5%. By further interpol...

متن کامل

Korean Linked Data on the Web: Text to RDF

Interlinking data coming from different sources has been a long standing goal [4] aiming to increase reusability, discoverability, and as a result the usefulness of information. Nowadays, Linked Open Data (LOD) tackles this issue in the context of semantic web. However, currently most of the web data is stored in relational databases and published as unstructured text. This triggers the need of...

متن کامل

Improving Text Simplification Language Modeling Using Unsimplified Text Data

In this paper we examine language modeling for text simplification. Unlike some text-to-text translation tasks, text simplification is a monolingual translation task allowing for text in both the input and output domain to be used for training the language model. We explore the relationship between normal English and simplified English and compare language models trained on varying amounts of t...

متن کامل

developing a pattern based on speech acts and language functions for developing materials for the course “ the study of islamic texts translation”

هدف پژوهش حاضر ارائه ی الگویی بر اساس کنش گفتار و کارکرد زبان برای تدوین مطالب درس "بررسی آثار ترجمه شده ی اسلامی" می باشد. در الگوی جدید، جهت تدوین مطالب بهتر و جذاب تر، بر خلاف کتاب-های موجود، از مدل های سطوح گفتارِ آستین (1962)، گروه بندی عملکردهای گفتارِ سرل (1976) و کارکرد زبانیِ هالیدی (1978) بهره جسته شده است. برای این منظور، 57 آیه ی شریفه، به صورت تصادفی از بخش-های مختلف قرآن انتخاب گردید...

15 صفحه اول

Web Augmentation of Language Models for Continuous Speech Recognition of SMS Text Messages

In this paper, we present an efficient query selection algorithm for the retrieval of web text data to augment a statistical language model (LM). The number of retrieved relevant documents is optimized with respect to the number of queries submitted. The querying scheme is applied in the domain of SMS text messages. Continuous speech recognition experiments are conducted on three languages: Eng...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Applied sciences

سال: 2022

ISSN: ['2076-3417']

DOI: https://doi.org/10.3390/app12073425